175 research outputs found
C-Pack: Packaged Resources To Advance General Chinese Embedding
We introduce C-Pack, a package of resources that significantly advance the
field of general Chinese embeddings. C-Pack includes three critical resources.
1) C-MTEB is a comprehensive benchmark for Chinese text embeddings covering 6
tasks and 35 datasets. 2) C-MTP is a massive text embedding dataset curated
from labeled and unlabeled Chinese corpora for training embedding models. 3)
C-TEM is a family of embedding models covering multiple sizes. Our models
outperform all prior Chinese text embeddings on C-MTEB by up to +10% upon the
time of the release. We also integrate and optimize the entire suite of
training methods for C-TEM. Along with our resources on general Chinese
embedding, we release our data and models for English text embeddings. The
English models achieve state-of-the-art performance on MTEB benchmark;
meanwhile, our released English data is 2 times larger than the Chinese data.
All these resources are made publicly available at
https://github.com/FlagOpen/FlagEmbedding
Recommended from our members
Hydrological variations in central China over the past millennium and their links to the Tropic Pacific and North Atlantic Oceans
Variations of precipitation, aka the Meiyu rain, in East Asian summer monsoon (EASM) domain during the last millennium could help enlighten the hydrological response to future global warming. Here we present a precisely dated and highly resolved stalagmite Ī“18O record from the Yongxing Cave, central China. Our new record, combined with a previously published one from the same cave, indicates that the Meiyu rain has changed dramatically in association with the global temperature change. In particular, our record shows that the Meiyu rain has been weakened during the Medieval Climate Anomaly (MCA), but intensified during the Little Ice Age (LIA). During the Current Warm Period (CWP), our record indicates a similar weakening of the Meiyu rain. Furthermore, during the MCA and CWP, our records show that the atmospheric precipitation is similarly wet in northern China and similarly dry in central China, but relatively wet during the CWP in southern China. This spatial discrepancy indicates a complicated localized response of the regional precipitation to the anthropogenic forcing. The weakened (intensified) Meiyu rain during the MCA (LIA) matches well with the warm (cold) phases of Northern Hemisphere surface air temperature. This Meiyu rain pattern also corresponds well with the climatic conditions over the Tropical Indo-Pacific warm pool. On the other hand, our record shows a strong association with the North Atlantic climate as well. The reduced (increased) Meiyu rain correlates well with positive (negative) phases of North Atlantic Oscillation. In addition, our record links well with the strong (weak) Atlantic meridional overturning circulation during the MCA (LIA) period. All above-mentioned localized correspondences and remote teleconnections on decadal to centennial timescales indicate that the Meiyu rain is coupled closely with oceanic processes in the Tropical Pacific and North Atlantic Oceans during the MCA and LIA
Probabilistic hesitant fuzzy multiple attribute decisionmaking based on regret theory for the evaluation of venture capital projects
The selection of venture capital investment projects is one of the
most important decision-making activities for venture capitalists.
Due to the complexity of investment market and the limited cognition
of people, most of the venture capital investment decision
problems are highly uncertain and the venture capitalists are
often bounded rational under uncertainty. To address such problems,
this article presents an approach based on regret theory to
probabilistic hesitant fuzzy multiple attribute decision-making.
Firstly, when the information on the occurrence probabilities of
all the elements in the probabilistic hesitant fuzzy element
(P.H.F.E.) is unknown or partially known, two different mathematical
programming models based on water-filling theory and the
maximum entropy principle are provided to handle these complex
situations. Secondly, to capture the psychological behaviours
of venture capitalists, the regret theory is utilised to solve the
problem of selection of venture capital investment projects.
Finally, comparative analysis with the existing approaches is conducted
to demonstrate the feasibility and applicability of the proposed
method
Wasserstein distance-based probabilistic linguistic TODIM method with application to the evaluation of sustainable rural tourism potential
The evaluation of sustainable rural tourism potential is a key work
in sustainable rural tourism development. Due to the complexity
of the rural tourism development situation and the limited cognition of people, most of the assessment problems for sustainable
rural tourism potential are highly uncertain, which brings challenges to the characterisation and measurement of evaluation
information. Besides, decision-makers (DMs) usually do not exhibit
complete rationality in the practical evaluation process. To tackle
such problems, this paper proposes a new behaviour multi-attribute group decision-making (MAGDM) method with probabilistic
linguistic terms sets (PLTSs) by integrating Wasserstein distance
measure into TODIM (an acronym in Portuguese of interactive
and multicriteria decision making) method. Firstly, a new
Wasserstein-based distance measure with PLTSs is defined, and
some properties of the proposed distance are developed.
Secondly, based on the correlation coefficient among attributes
and standard deviation of each attribute, an attribute weight
determination method (called PL-CRITIC method) is proposed.
Subsequently, a Wasserstein distance-based probabilistic linguistic
TODIM method is developed. Finally, the proposed method is
applied to the evaluation of sustainable rural tourism potential,
along with sensitivity and comparative analyses, as a means of
illustrating the effectiveness and advantages of the new method
Retrieve Anything To Augment Large Language Models
Large language models (LLMs) face significant challenges stemming from their
inherent limitations in knowledge, memory, alignment, and action. These
challenges cannot be addressed by LLMs alone, but should rely on assistance
from the external world, such as knowledge base, memory store, demonstration
examples, and tools. Retrieval augmentation stands as a vital mechanism for
bridging the gap between LLMs and the external assistance. However,
conventional methods encounter two pressing issues. On the one hand, the
general-purpose retrievers are not properly optimized for the retrieval
augmentation of LLMs. On the other hand, the task-specific retrievers lack the
required versatility, hindering their performance across the diverse retrieval
augmentation scenarios.
In this work, we present a novel approach, the LLM-Embedder, which
comprehensively supports the diverse retrieval augmentation needs of LLMs with
one unified embedding model. Training such a unified model is non-trivial, as
various retrieval tasks aim to capture distinct semantic relationships, often
subject to mutual interference. To address this challenge, we systematically
optimize our training methodology. This includes reward formulation based on
LLMs' feedback, the stabilization of knowledge distillation, multi-task
fine-tuning with explicit instructions, and homogeneous in-batch negative
sampling. These optimization strategies contribute to the outstanding empirical
performance of the LLM-Embedder. Notably, it yields remarkable enhancements in
retrieval augmentation for LLMs, surpassing both general-purpose and
task-specific retrievers in various evaluation scenarios. Our checkpoint and
source code are publicly available at
https://github.com/FlagOpen/FlagEmbedding
- ā¦